Document Selection Using Mapreduce
نویسندگان
چکیده
منابع مشابه
Ontology Based Document Clustering Using MapReduce
Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase in the number of available documents. Nevertheless, the features that represent those documents are also too large. The most common method for representing documents is the vector space model, which represents document features as a bag of words and does not represent semantic relations betwe...
متن کاملFeature Selection in High-Dimensional Dataset Using MapReduce
This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving ...
متن کاملWeb Document Clustering Using Threshold Selection Partitioning
Clustering techniques have been applied to categorize documents on World Wide Web. In previous research, PDDP (Principal Direction Divisive Partitioning) is a well-known clustering algorithm. PDDP algorithm employs top-down and unsupervised clustering based on the principal component analysis and splits documents into two sets using a plane perpendicular to the maximum principal direction passi...
متن کاملA MapReduce Relational-Database Index-Selection Tool
The physical design of data storage is a critical administrative task for optimizing system performance. Selecting indices properly is a fundamental aspect of the system design. Index selection optimization has been widely studied in DataBase Management Systems (DBMSs). However, current DBMS are not appropriate platforms for many data nowadays. As a result, several systems have been developed t...
متن کاملDesign and Implement of Distributed Document Clustering Based on MapReduce
In this paper, we describe how document clustering for large collection can be efficiently implemented with MapReduce. Hadoop implementation provides a convenient and flexible framework for distributed computing on a cluster of commodity machines. The design and implementation of tfidf and K-Means algorithm on MapReduce is presented. More importantly, we improved the efficiency and effectivenes...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Security, Privacy and Trust Management
سال: 2015
ISSN: 2319-4103,2277-5498
DOI: 10.5121/ijsptm.2015.4401